unique and duplicated warn on other encodings than UTF8 by ben-schwen · Pull Request #7379 · Rdatatable/data.table

ben-schwen · 2025-10-19T21:19:30Z

Closes #469

Not exactly what Arun suggested but seems like the best option since we encode to UTF8 in forderv. Is a warning too much here?

codecov · 2025-10-19T21:37:41Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 99.12%. Comparing base (59f966c) to head (2f49a9a).

Additional details and impacted files

@@           Coverage Diff           @@
##           master    #7379   +/-   ##
=======================================
  Coverage   99.12%   99.12%           
=======================================
  Files          85       85           
  Lines       16637    16640    +3     
=======================================
+ Hits        16492    16495    +3     
  Misses        145      145

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

jangorecki · 2025-10-20T05:57:15Z

Let's see revdeps. If none affected then I would keep it like this.

aitap · 2025-10-20T19:27:20Z

This may bite fread() users who rely on the default value of the encoding= argument: ``` r fwrite(list(x = c('a møøse', 'bit my sister once')), 'foo.txt') fread('foo.txt', sep='\t') |> setkey(x) |> unique() |> _$x |> str() # chr [1:2] "a møøse" "bit my sister once" # Warning message: # In unique.data.table(setkey(fread("foo.txt", sep = "\t"), x)) : # Mixed encodings detected. Strings were coerced to UTF-8 before # unique(x). ``` A similar problem may happen to users of old versions of R or old versions of Windows where non-ASCII string literals would result in CE_NATIVE strings instead of CE_UTF8. Should the warning recommend enc2utf8()? Could someone please share the original context from R-Forge #5758?

MichaelChirico · 2025-10-20T19:34:24Z

Could someone please share the original context from R-Forge #5758?

Basically, it links to https://stackoverflow.com/questions/24085906/unique-data-table-do-not-handle-keys-properly

jangorecki · 2025-10-20T19:38:57Z

Late warning about unexpected consequences is better than no warning, so still despite that I see value in it.
And what about changing the default of fread as well?

aitap · 2025-10-20T19:53:44Z

Is lower performance meant to be the unexpected consequence? Otherwise we seem to do the same thing as what R does in identical(). Given how most systems already speak UTF-8 as the native encoding, changing the default fread(encoding=) argument to "UTF-8" should be mostly harmless. Only people who read invalid UTF-8 or use a non-UTF-8 system will notice a difference.

ben-schwen · 2025-10-22T09:18:56Z

Good points! I've added fread(encoding="UTF-8") as default encoding. Given how much support the addition of encoding='UTF-8' got in #563 this sounds also like what the community wants/needs.

I have also filed this now as breaking change.

aitap · 2025-10-22T09:26:51Z

The warning is not caused by mixed encodings. The previous example sorts a vector of CE_NATIVE strings, which is actually safe to perform without conversion to UTF-8, but the warning still appears. It would be more precise to say "Detected strings not marked as ASCII or UTF-8 and converted them to UTF-8 for detection of duplicates. Please convert your character columns using enc2utf8() to avoid the on-the-fly conversion." Could you please explain why is the result unexpected when calling duplicated() or unique() without converting strings to UTF-8 first? After all, identical(enc2utf8('ø'), iconv('ø', to='latin1')) is TRUE, even if those are different CHARSXPs under the hood.

github-actions · 2025-10-22T09:44:16Z

No obvious timing issues in HEAD=warn_encodings

Generated via commit 2f49a9a

Download link for the artifact containing the test results: ↓ atime-results.zip

Task	Duration
R setup and installing dependencies	2 minutes and 43 seconds
Installing different package versions	21 seconds
Running and plotting the test cases	2 minutes and 37 seconds

ben-schwen · 2025-10-22T09:53:39Z

I guess what Jan meant by unexptected consequences is that forderv can simply take longer because user are unaware of encodings.

jangorecki · 2025-10-22T12:17:58Z

Simply longer is not a problem, I thought that something that was giving T for == can now start to return F

add warning for encodings other than utf8 in unique and duplicated

4b8028d

ben-schwen requested a review from MichaelChirico as a code owner October 19, 2025 21:19

jangorecki approved these changes Oct 20, 2025

View reviewed changes

add UTF-8 as standard encoding

4e95497

ben-schwen added 3 commits October 22, 2025 11:19

remove spilled newline

88b1081

update fread man page

e35d5ad

Merge branch 'master' into warn_encodings

622de2b

ben-schwen added 2 commits October 22, 2025 11:37

fix typo

48e021e

add info about enc2utf8

cc01889

update docs

2f49a9a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

unique and duplicated warn on other encodings than UTF8#7379

unique and duplicated warn on other encodings than UTF8#7379
ben-schwen wants to merge 8 commits intomasterfrom
warn_encodings

ben-schwen commented Oct 19, 2025

Uh oh!

codecov bot commented Oct 19, 2025 •

edited

Loading

Uh oh!

jangorecki commented Oct 20, 2025

Uh oh!

aitap commented Oct 20, 2025 via email •

edited by ben-schwen

Loading

Uh oh!

MichaelChirico commented Oct 20, 2025

Uh oh!

jangorecki commented Oct 20, 2025

Uh oh!

aitap commented Oct 20, 2025 via email

Uh oh!

ben-schwen commented Oct 22, 2025

Uh oh!

aitap commented Oct 22, 2025 via email

Uh oh!

github-actions bot commented Oct 22, 2025 •

edited

Loading

Uh oh!

ben-schwen commented Oct 22, 2025

Uh oh!

jangorecki commented Oct 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

ben-schwen commented Oct 19, 2025

Uh oh!

codecov bot commented Oct 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

jangorecki commented Oct 20, 2025

Uh oh!

aitap commented Oct 20, 2025 via email • edited by ben-schwen Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MichaelChirico commented Oct 20, 2025

Uh oh!

jangorecki commented Oct 20, 2025

Uh oh!

aitap commented Oct 20, 2025 via email

Uh oh!

ben-schwen commented Oct 22, 2025

Uh oh!

aitap commented Oct 22, 2025 via email

Uh oh!

github-actions bot commented Oct 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ben-schwen commented Oct 22, 2025

Uh oh!

jangorecki commented Oct 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

codecov bot commented Oct 19, 2025 •

edited

Loading

aitap commented Oct 20, 2025 via email •

edited by ben-schwen

Loading

github-actions bot commented Oct 22, 2025 •

edited

Loading